JAVA-3051: Memory leak #1743

SiyaoIsHiding · 2023-11-02T22:13:54Z

No description provided.

SiyaoIsHiding · 2023-11-06T20:21:41Z

We explored several alternatives for unit testing this memory leak fix but decided to give up testing it, as we could not find a reliable way to test the garbage collector's behavior.

The unit test we wrote and considered:

cassandra-java-driver/core/src/test/java/com/datastax/oss/driver/internal/core/metadata/LoadBalancingPolicyWrapperMemoryLeakTest.java

Lines 118 to 142 in cb61eb8

    
           public void should_garbage_collect_without_strong_references() { 
        
               // given that 
        
               given(nodeDistanceEvaluator.evaluateDistance(weakNode1, null)).willReturn(NodeDistance.IGNORED); 
        
               given(nodeDistanceEvaluator.evaluateDistance(weakNode2, null)).willReturn(NodeDistance.IGNORED); 
        
               // weak references to poke the private WeakHashMap in LoadBalancingPolicyWrapper.distances 
        
               WeakReference<DefaultNode> weakReference1 = new WeakReference<>(weakNode1); 
        
               WeakReference<DefaultNode> weakReference2 = new WeakReference<>(weakNode2); 
        
               wrapper.init(); 
        
               // remove all the strong references, including the ones held by Mockito 
        
               weakNode2 = null; 
        
               reset(metricsFactory); 
        
               reset(distanceReporter); 
        
               reset(nodeDistanceEvaluator); 
        
               reset(metadata); 
        
               // verify 
        
               System.gc(); 
        
               assertThat(weakReference1.get()).isNotNull(); 
        
               await().atMost(10, TimeUnit.SECONDS) 
        
                       .until(() -> weakReference2.get() == null); 
        
           }

This test:

creates two DefaultNode
creates two WeakReference pointing to the nodes, just to poke their existence later
initializes the policy
clear all the strong references
requests for garbage collection
verify the node is collected

We checked that

In my local environment (Zulu 8.72.0.17-CA-macos-aarch64), this test will succeed, and if I revert the changes from WeakHashMap to the strong HashMap, this test will fail.
Before all the strong references are cleared, poking the memory, the referring objects to the weakNode2 are:
a. weakNode2 and weakReference2
b. InterceptedInvocations by Mockito
c. HashMap of allNodes stored in when(metadata.getNodes()).thenReturn(allNodes);
d. wanted in Equals statement in await().atMost(10, TimeUnit.SECONDS).until(() -> weakReference2.get() == null);
These are all expected and no reference is leaked.
If evaluateDistance of nodes does not return IGNORED, they will be stored in BasicLoadBalancingPolicy.liveNodes. But nodes there can be removed later by onDown or onRemoved. We suppose this is intended.

We considered that

According to this post, System.gc() is more like a request/hint that some JVM will ignore, and there is no reliable way to force garbage collection. This means the test above may fail in other environments, but the last thing we want is a flaky test.

We think workarounds like generating a huge amount of garbage to trigger garbage collection may not be worth it, either.

Therefore, we concluded that no test may be the best choice for now, and the checks we perform above may be sufficient.

hhughes

This change will allow entries to be be dropped from LoadBalancingPolicyWrapper#distances but I'm not entirely convinced there aren't other places where strong references to the Node object will remain.

In DefaultLoadBalancingPolicy there are responseTimes and upTimes maps which use Node as the key with a ConcurrentHashMap and I don't see where entries are ever removed so this likely will continue to hold these references (although with upTimes it doesn't look like items are ever added).

ControlConnection maintains two weak hash maps - lastDistanceEvents and lastStateEvents - where both the value types DistanceEvent and NodeStateEvent hold a hard reference to the Node, per the WeakHashMap docs it looks like this will prevent the entries being cleaned up:

The value objects in a WeakHashMap are held by ordinary strong references. Thus care should be taken to ensure that value objects do not strongly refer to their own keys, either directly or indirectly, since that will prevent the keys from being discarded

Likely there are more places too.

I think it could be good to set up at least a one-off test which reproduces the leak from the original ticket and confirm that this change (and possibly the others mentioned above) successfully prevent the leak before marking this one as completed.

...src/main/java/com/datastax/oss/driver/internal/core/metadata/LoadBalancingPolicyWrapper.java

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

…eTimes

hhughes

Looks good! Couple of minor pieces of feedback, most importantly I think we want to avoid logging at info when ignoring events which have lost their node reference as this might end up creating a lot of unnecssary log churn

hhughes · 2023-11-17T20:36:16Z

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

-    return true;
+    AtomicLongArray array = responseTimes.getIfPresent(node);
+    if (array == null) return true;
+    else if (array.length() == 2) {


nit: consider collapsing the null check into this conditional so there is only one irregular state return value (return true)

hhughes · 2023-11-20T19:36:08Z

...src/main/java/com/datastax/oss/driver/internal/core/metadata/LoadBalancingPolicyWrapper.java

-            policy.onUp(event.node);
+          DefaultNode node = event.node.get();
+          if (node == null) {
+            LOG.info("[{}] Node for this event was removed, ignoring: {}", logPrefix, event);


Info-level log might be a bit high for this notice as there isn't really action the user should take when this happens. Consider dropping to debug/trace

hhughes · 2023-11-20T19:44:37Z

...src/main/java/com/datastax/oss/driver/internal/core/metadata/LoadBalancingPolicyWrapper.java

-          if (event.newState == NodeState.UP) {
-            policy.onUp(event.node);
+          DefaultNode node = event.node.get();
+          if (node == null) {


nit: do we need to re-perform the null for every policy? Is there a good reason not to pull this out of the loop?

hhughes · 2023-11-20T19:47:54Z

core/src/main/java/com/datastax/oss/driver/internal/core/metadata/NodeStateEvent.java

@@ -53,10 +54,10 @@ public static NodeStateEvent removed(DefaultNode node) {
   */
  public final NodeState newState;

-  public final DefaultNode node;
+  public final WeakReference<DefaultNode> node;


Nit: Since we're changing the type here I'm wondering if it might be cleaner to provide a @nullable getter for DefaultNode, rather than exposing the weak reference directly. Same comment for DistanceEvent.node.

hhughes · 2023-11-20T19:52:12Z

core/src/main/java/com/datastax/oss/driver/internal/core/session/DefaultSession.java

-        context.getNodeStateListener().onRemove(event.node);
+      DefaultNode node = event.node.get();
+      if (node == null) {
+        LOG.info(


Nit: I think info-level is too high here

hhughes · 2023-11-20T19:52:43Z

.../main/java/com/datastax/oss/driver/internal/metrics/micrometer/MicrometerMetricsFactory.java

@@ -119,15 +120,22 @@ public NodeMetricUpdater newNodeUpdater(Node node) {
  }

  protected void processNodeStateEvent(NodeStateEvent event) {
+    DefaultNode node = event.node.get();
+    if (node == null) {
+      LOG.info(


Nit: I think info-level is too high here

hhughes · 2023-11-20T19:52:50Z

...n/java/com/datastax/oss/driver/internal/metrics/microprofile/MicroProfileMetricsFactory.java

@@ -121,16 +122,22 @@ public NodeMetricUpdater newNodeUpdater(Node node) {
  }

  protected void processNodeStateEvent(NodeStateEvent event) {
+    DefaultNode node = event.node.get();
+    if (node == null) {
+      LOG.info(


Nit: I think info-level is too high here

hhughes

LGTM

aratno

Jira link for my own reference: https://datastax-oss.atlassian.net/browse/JAVA-3051

It would be helpful to have logs from the repro that the user submitted, specifically the logs from com.datastax.oss.driver.internal.core.metadata.NodeStateManager.SingleThreaded#setState that look like "[{}] Transitioning {} {}=>{} (because {})".

This seems like a bug in AWS Keyspaces, since each node includes itself in system.peers, which is unexpected to the driver according to the user's report:

[s0] Control node has an entry for itself in system.peers: this entry will be ignored. This is likely due to a misconfiguration; please verify your rpc_address configuration in cassandra.yaml on all nodes in your cluster.

I left a few comments but otherwise these seem like generally positive changes. I agree that it's difficult to write unit tests for memory leaks like these, especially without any scaffolding around heapdump capture or parsing. I'm a bit concerned that there may be paths where a node event may not have any strong reference and is then garbage-collected and ignored by handlers, rather than surviving long enough to serve its purpose.

aratno · 2024-02-05T22:41:21Z

...rc/main/java/com/datastax/oss/driver/internal/core/util/concurrent/ReplayingEventFilter.java

@@ -82,6 +82,7 @@ public void markReady() {
        consumer.accept(event);
      }
    } finally {
+      recordedEvents.clear();


What's the reasoning for this change?

ReplayingEventFilter works like a buffer. It holds a list (queue) of Events since its state is STARTED, and consumes all of them when its state becomes READY all at once. However, the list of the events is never cleared. They leak strong references for the nodes.

aratno · 2024-02-05T22:43:12Z

core/src/main/java/com/datastax/oss/driver/internal/core/metrics/AbstractMetricUpdater.java

+              clearMetrics();
+              cancelMetricsExpirationTimeout();


What's the reasoning for this change?

This is a lambda for timeout. Even after the timeout and lambda triggered, the Timer object is not collected and it still holds a reference to this, until it's canceled.

aratno · 2024-02-05T22:55:08Z

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

+      long threshold = now - RESPONSE_COUNT_RESET_INTERVAL_NANOS;
+      long leastRecent = array.get(0);
+      return leastRecent - threshold < 0;
+    } else return true;


Style nit: Invert the condition and use an early-return if response rate is insufficient, so you don't have else return true

aratno · 2024-02-05T23:01:23Z

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

  protected final Map<Node, Long> upTimes = new ConcurrentHashMap<>();
  private final boolean avoidSlowReplicas;

  public DefaultLoadBalancingPolicy(@NonNull DriverContext context, @NonNull String profileName) {
    super(context, profileName);
    this.avoidSlowReplicas =
        profile.getBoolean(DefaultDriverOption.LOAD_BALANCING_POLICY_SLOW_AVOIDANCE, true);
+    CacheLoader<Node, AtomicLongArray> cacheLoader =


Style nit: use a separate class for the cache value here, rather than using AtomicLongArray as a generic container. Seems like it can be something like NodeResponseRateSample, with methods like boolean hasSufficientResponses. I see this was present in the previous implementation, so not a required change for this PR, just something I noticed.

absurdfarce · 2024-02-05T23:22:25Z

Very much agreed that the underlying issue here appears to be an issue with AWS Keyspaces @aratno; that's being addressed in a different ticket. The scope of this change is around preventing the (potentially indefinite) caching of Node instances within an LBP.

aratno · 2024-02-06T02:49:29Z

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

+            return array;
+          }
+        };
+    this.responseTimes = CacheBuilder.newBuilder().weakKeys().build(cacheLoader);


I think we should add a RemovalListener here.

If a GC happens and response times for a Node are purged, then we'll end up treating that as "insufficient responses" in isResponseRateInsufficient, which can lead us to mark a node as unhealthy. I recognize that this is a bit of a pathological example, but this behavior does depend on GC timing and would be a pain to track down, so adding logging could make someone's life easier down the line.

Thank you for your review. Would you please explain more about this?
If GC collects a node, that means the node is gone. If the node is gone, why do we care about whether it's treated as healthy or not?
Anyway, for RemovalListener, do you mean sth like this?

this.responseTimes = CacheBuilder.newBuilder().weakKeys().removalListener( (RemovalListener<Node, AtomicLongArray>) notification -> LOG.trace("[{}] Evicting response times for {}: {}", logPrefix, notification.getKey(), notification.getCause())) .build(cacheLoader);

@aratno Hi Abe, thank you for your review. Is there any update?

absurdfarce

Good work @SiyaoIsHiding! I'd like to talk about the weak references in events a bit more. There are a few other changes mentioned as well but the event references are really the big one for me.

absurdfarce · 2024-06-05T22:26:52Z

core/src/main/java/com/datastax/oss/driver/internal/core/metadata/NodeStateEvent.java

@@ -53,14 +55,19 @@ public static NodeStateEvent removed(DefaultNode node) {
   */
  public final NodeState newState;

-  public final DefaultNode node;
+  private final WeakReference<DefaultNode> node;


I don't think we need this change. Events of this type are very short-lived; they exist to communicate information between driver components that don't necessarily know about each other. They aren't stockpiled or stored in any meaningful way. You address the problem that was originally reported by changing the distances map in LoadBalancingPolicyWrapper to use weak refs... it's not at all clear to me that making the events use weak references buys you much on top of that.

Perhaps more importantly this change has the potential to make events a lot less useful. The driver uses events to notify components about changes in nodes, but if the actual node affected might not be present what use is the notification? Components that receive events without node information have no choice but to ignore them which means (a) every receiver has to check for null node information and (b) if you just ignore events without node data (which all these receivers appear to do) you'll get a lot more indeterminate behaviour since the presence or absence of node data in events is basically a random value (since it's determined by GC pressure which is essentially random from the perspective of the driver).

I note that the null checks referenced in (a) above are a big chunk of what's actually in this PR. If not for this constraint the change set would be a lot smaller. That wouldn't be the worse thing as the presence of all those null checks obscures the more significant changes in at least a moderate way.

@absurdfarce I will try to revert the changes and see whether it still leaks, but:
There are two WeakHashMap held by ControlConnection:

private final Map<Node, DistanceEvent> lastDistanceEvents = new WeakHashMap<>(); private final Map<Node, NodeStateEvent> lastStateEvents = new WeakHashMap<>();

However, both DistanceEvent and NodeStateEvent hold strong references to the associated node
From the doc of WeakHashMap:

Implementation note: The value objects in a WeakHashMap are held by ordinary strong references. Thus care should be taken to ensure that value objects do not strongly refer to their own keys, either directly or indirectly, since that will prevent the keys from being discarded.

In this case DistanceEvent and NodeStateEvent are the values that hold strong references to the keys. Therefore, I think it may leak references.

Tested out. If NodeStateEvent and DistanceEvent hold strong references to node, then objects of these two kinds, together with DefaultNode, will not be garbage collected.

Confirmed that if we put lastStateEvent and lastDistanceEvent as two fields of DefaultNode, then it will work. DefaultNode <-> NodeStateEvent only point to each other, same for DistanceEvent. Objects of all three classes will be collected. However, I understand this is not an ideal solution.

What about this:

private final Map<Node, NodeDistance> lastDistanceEvents = new WeakHashMap<>(); private final Map<Node, NodeState> lastStateEvents = new WeakHashMap<>();

absurdfarce · 2024-06-05T22:30:15Z

core/src/main/java/com/datastax/oss/driver/internal/core/metadata/DistanceEvent.java

@@ -29,11 +31,16 @@
 @Immutable
 public class DistanceEvent {
  public final NodeDistance distance;
-  public final DefaultNode node;
+  private final WeakReference<DefaultNode> node;


Same as discussed in NodeStateEvent; I don't think we need a WeakReference here.

absurdfarce · 2024-06-05T22:46:21Z

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

+                            "[{}] Evicting response times for {}: {}",
+                            logPrefix,
+                            notification.getKey(),
+                            notification.getCause()))


@aratno I know you asked for this RemovalListener in a previous comment under the assumption that it would help identify cases in which a node was (incorrectly) marked unhealthy because data had expired from the map. I'm going to argue instead that we shouldn't mark a node as unhealthy unless we have clear data indicating that it is so... which means the absence of response time data isn't enough to make it unhealthy. If we take that approach would you still argue for a removal listener here? I'm kind of two minds about it myself; I can see the benefit but it's possibly less interesting if the potentially damaging effects are mitigated.

Thoughts?

Also, a related question: do we know if RemovalListeners get called when weak keys are removed via GC? Javadoc seems to say yes but it might be useful to confirm.

absurdfarce · 2024-06-05T22:50:57Z

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

+    if (sample == null) {
+      return true;
+    } else {
+      return !sample.hasSufficientResponses(now);


Style nit: the ternary operator does this a bit more cleanly:

return (sample == null) ? true : !sample.hasSufficientResponses(now)

Or even:

return sample == null || !sample.hasSufficientResponses(now)

absurdfarce · 2024-06-05T22:55:01Z

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

-      }
+    NodeResponseRateSample sample = responseTimes.getIfPresent(node);
+    if (sample == null) {
+      return true;


I'm wondering if this is too aggressive now that we're not guaranteed that historical node response time data will always be present. I'm leaning towards an argument which says that we should only say a node's response time is insufficient if we have clear, unambiguous data which indicates that. That means that missing data (either because it hasn't been observed yet or GC pressure has removed it now that we're using weak refs) would not be enough to mark a node as unhealthy.

@adutra I'm very interested in your take on this; you have considerably more context with which to evaluate this question than I do. :)

Yes, I tend to agree that the absence of any samples should be interpreted as "we don't know".

I want to confirm that this means both "no oldest" and "no newest" will be considered "sufficient data". Right?

absurdfarce · 2024-06-05T23:12:34Z

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

+            }
+            return sample;
+          }
+        };


You actually don't need a LoadingCache here; a simple Cache will do. A LoadingCache is useful when you need to create an entry for a key that's been requested but isn't in the cache yet. That's not what's going on here; your CacheLoader is loading keys from the map, creating a new version if it isn't present and then updating it. You're not doing something to create the NodeResponseRateSample instance beyond calling the constructor if necessary.

You can accomplish the exact same thing using just a regular Cache by changing your update logic just a bit:

protected void updateResponseTimes(@NonNull Node node) { try { responseTimes.get(node, NodeResponseRateSample::new).update(); } catch (ExecutionException ee) { LOG.info("[{}] Exception updating node response times: {}", logPrefix, ee); } }

I note that this approach is even recommended in the Guava Javadoc.

Isn't your version subject to race conditions though. In the loading cache version, NodeResponseRateSample.update() is called inside CacheLoader.load() and thus cannot be called concurrently. In your suggestion, update() is called after the value is retrieved from the cache, and thus could be executed concurrently.

Honestly I've gone round and round about this one over the past few days. I agree that my impl isn't right but a LoadingCache (or at least a LoadingCache used in the way it's used here) still feels wrong to me. I vastly prefer the approach of the original code which just leverages ConcurrentHashMap.compute() to do the right thing.

I think part of the mismatch for me here is that CacheLoader.get() doesn't provide the current value (if any) as an arg so you wind up having to do this weird refresh() dance. I'm assuming part of the reason the key wasn't provided is because that's not really the intended use case for a CacheLoader but I digress. As mentioned ConcurrentHashMap seems to model this case more effectively, and while that class doesn't support weak references directly we can get there indirectly with Guava'a MapMaker: new MapMaker().weakKeys.getMap() will give us what we want. The only thing we're missing then is the RemovalListener, but as mentioned elsewhere (a) I'm not sure if that's necessary now and (b) I'm not sure it'll do what we want anyway.

To make this more concrete: I got the relevant test case (DefaultLoadBalancingPolicyRequestTrackerTest) with the following changes:

protected final ConcurrentMap<Node, NodeResponseRateSample> responseTimes; protected final Map<Node, Long> upTimes = new ConcurrentHashMap<>(); private final boolean avoidSlowReplicas; public DefaultLoadBalancingPolicy(@NonNull DriverContext context, @NonNull String profileName) { super(context, profileName); this.avoidSlowReplicas = profile.getBoolean(DefaultDriverOption.LOAD_BALANCING_POLICY_SLOW_AVOIDANCE, true); this.responseTimes = new MapMaker().weakKeys().makeMap(); } ... protected void updateResponseTimes(@NonNull Node node) { this.responseTimes.compute(node, (k,v) -> v == null ? new NodeResponseRateSample() : v.next()); } ... // Exposed as protected for unit tests protected class NodeResponseRateSample { @VisibleForTesting protected final long oldest; @VisibleForTesting protected final OptionalLong newest; private NodeResponseRateSample() { long now = nanoTime(); this.oldest = now; this.newest = OptionalLong.empty(); } private NodeResponseRateSample(long oldestSample) { this(oldestSample, nanoTime()); } private NodeResponseRateSample(long oldestSample, long newestSample) { this.oldest = oldestSample; this.newest = OptionalLong.of(newestSample); } @VisibleForTesting protected NodeResponseRateSample(AtomicLongArray times) { assert times.length() >= 1; this.oldest = times.get(0); this.newest = (times.length() > 1) ? OptionalLong.of(times.get(1)) : OptionalLong.empty(); } // Our newest sample becomes the oldest in the next generation private NodeResponseRateSample next() { return new NodeResponseRateSample(this.getNewestValidSample(), nanoTime()); } // If we have a pair of values return the newest, otherwise we have just one value... so just return it private long getNewestValidSample() { return this.newest.orElse(this.oldest); } // response rate is considered insufficient when less than 2 responses were obtained in // the past interval delimited by RESPONSE_COUNT_RESET_INTERVAL_NANOS. private boolean hasSufficientResponses(long now) { // If we only have one sample it's an automatic failure if (!this.newest.isPresent()) return false; long threshold = now - RESPONSE_COUNT_RESET_INTERVAL_NANOS; return this.oldest - threshold >= 0; } }

absurdfarce · 2024-06-05T23:28:50Z

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

+        times.set(0, times.get(1));
+        times.set(1, now);
+      }
+    }


This seems to be subject to race conditions to me, although I haven't been able to construct a specific scenario that triggers it. The old implementation computed a result array locally and then returned that; there was no changing of state at the ultimate destination (the old responseTimes map) beyond updating an entry or not doing so. This implementation is doing more than that; we're actually changing the types of elements (this.times changes in size) based on data we receive.

It might be okay... but then again I'm also not sure I see a harm in just making this.times be an array of size two from the beginning.

As I stated previously, this will race indeed, unless calls to update() are externally synchronized. I think the loading cache approach does guarantee that.

The idea of using an array of size 2 from the beginning could save us one array allocation – why not.

adutra · 2024-06-07T15:58:45Z

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

+            }
+            return sample;
+          }
+        };


Isn't your version subject to race conditions though. In the loading cache version, NodeResponseRateSample.update() is called inside CacheLoader.load() and thus cannot be called concurrently. In your suggestion, update() is called after the value is retrieved from the cache, and thus could be executed concurrently.

adutra · 2024-06-07T15:58:53Z

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

-      }
+    NodeResponseRateSample sample = responseTimes.getIfPresent(node);
+    if (sample == null) {
+      return true;


Yes, I tend to agree that the absence of any samples should be interpreted as "we don't know".

adutra · 2024-06-07T15:59:42Z

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

+    if (sample == null) {
+      return true;
+    } else {
+      return !sample.hasSufficientResponses(now);


Or even:

return sample == null || !sample.hasSufficientResponses(now)

adutra · 2024-06-07T16:02:32Z

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

+        times.set(0, times.get(1));
+        times.set(1, now);
+      }
+    }


As I stated previously, this will race indeed, unless calls to update() are externally synchronized. I think the loading cache approach does guarantee that.

The idea of using an array of size 2 from the beginning could save us one array allocation – why not.

adutra · 2024-06-07T16:02:54Z

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

+      times.set(0, nanoTime());
+    }
+
+    // Only for unit tests


Maybe annotate with @VisibleForTesting?

SiyaoIsHiding · 2024-06-20T06:43:26Z

We still need to figure out how NodeStateEvent and DistanceEvent refer to their node. Other than that, the update PR should have resolved all the other issues.

…deState> lastNodeState

SiyaoIsHiding · 2024-06-27T05:29:11Z

Solved the NodeState and NodeDistance problem as discussed. All the issues should be resolved now. I would appreciate a review :)

adutra · 2024-06-27T07:27:17Z

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

-          }
-          return array;
-        });
+    this.responseTimes.compute(node, (k, v) -> v == null ? new NodeResponseRateSample() : v.next());


Not sure if this was discussed already: the new design is undeniably more elegant, but it also allocates one new NodeResponseRateSample each time we access the map entry (because the entry value is immutable). In the old design the value was mutable in order to avoid allocations. Are we OK with this change?

My expectation is that it wouldn't matter since we're creating (and GCing) pretty small objects and the JVM usually handles that quite well but it's not unreasonable to confirm that.

@SiyaoIsHiding is this something you can test? I suppose if we can demonstrate roughly equivalent memory usage and throughput for a decent size load that would be fine.

I tested with a client app with VisualVM.
Using this branch, NodeResponseSample uses up to 5000 live bytes with 170 live objects. The app finished 40000 requests in 154 seconds.
Using the current 4.x branch, A lambda in DefaultLoadBalancingPolicy uses up to 3600 live bytes with 200 live objects. The only lambda I find there is in the responseTimes.compute here. The app finished 40000 requests in 161 seconds.
So I think there isn't much performance difference.

memory leak inital change and test

1deb2f3

SiyaoIsHiding marked this pull request as ready for review November 2, 2023 22:14

SiyaoIsHiding added 2 commits November 2, 2023 15:46

remove wildcard import

cb61eb8

delete test

4f9d779

hhughes reviewed Nov 6, 2023

View reviewed changes

...src/main/java/com/datastax/oss/driver/internal/core/metadata/LoadBalancingPolicyWrapper.java Outdated Show resolved Hide resolved

SiyaoIsHiding added 6 commits November 13, 2023 11:03

format

7042177

For the memory leak of DefaultLoadBalancingPolicy.responseTimes

2f23cae

For the memory leak of DistanceEvent.node

6fa4cb2

format

e443882

WIP: For memory leak from NodeStateEvent.node

1588ebd

remove wildcard

35eab99

hhughes reviewed Nov 14, 2023

View reviewed changes

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java Outdated Show resolved Hide resolved

hhughes reviewed Nov 14, 2023

View reviewed changes

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java Outdated Show resolved Hide resolved

hhughes reviewed Nov 14, 2023

View reviewed changes

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java Outdated Show resolved Hide resolved

hhughes reviewed Nov 14, 2023

View reviewed changes

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java Outdated Show resolved Hide resolved

hhughes reviewed Nov 14, 2023

View reviewed changes

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java Outdated Show resolved Hide resolved

SiyaoIsHiding added 4 commits November 15, 2023 15:19

address requests for changes about DefaultLoadBalancingPolicy.respons…

6ed271a

…eTimes

revert AtomicLongArray changes

ddd6877

Add doc for synchronous load

39eb4ac

fix ReplayingEventFilter.recordedEvents and MicrometerNodeMetricUpdater

08ad473

hhughes suggested changes Nov 20, 2023

View reviewed changes

SiyaoIsHiding added 2 commits November 20, 2023 15:38

Merge remote-tracking branch 'upstream/4.x' into memory-leak

ce108d7

LOG.info to LOG.debug. Nullable getters.

7684201

SiyaoIsHiding requested a review from hhughes November 29, 2023 17:39

hhughes approved these changes Dec 1, 2023

View reviewed changes

aratno reviewed Feb 5, 2024

View reviewed changes

aratno reviewed Feb 6, 2024

View reviewed changes

Merge branch '4.x' of github.com:datastax/java-driver into memory-leak

9a4779a

SiyaoIsHiding added 2 commits March 7, 2024 15:42

without refactoring of NodeResponseRateSample

99c5c5b

Refactored NodeResponseRateSample

86b0e1c

SiyaoIsHiding requested review from aratno and hhughes April 25, 2024 18:42

absurdfarce requested changes Jun 5, 2024

View reviewed changes

adutra reviewed Jun 7, 2024

View reviewed changes

SiyaoIsHiding added 5 commits June 14, 2024 15:58

Merge remote-tracking branch 'upstream/4.x' into memory-leak

5363a73

revert changes on NodeStateEvent and DistanceEvent

36d8dd5

refactor responseTimes to use new MapMaker().weakKeys().makeMap();

7f15b93

return sample == null || !sample.hasSufficientResponses(now);

d616d06

missing response times data considered sufficient

cde9bbb

Refactor to Map<Node, NodeDistance> lastNodeDistance and Map<Node, No…

4c2ca83

…deState> lastNodeState

SiyaoIsHiding requested review from adutra and absurdfarce June 27, 2024 05:29

adutra reviewed Jun 27, 2024

View reviewed changes

JAVA-3051: Memory leak #1743

Are you sure you want to change the base?

JAVA-3051: Memory leak #1743

Conversation

SiyaoIsHiding commented Nov 2, 2023

SiyaoIsHiding commented Nov 6, 2023

The unit test we wrote and considered:

We checked that

We considered that

hhughes left a comment

Choose a reason for hiding this comment

hhughes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hhughes left a comment

Choose a reason for hiding this comment

aratno left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

absurdfarce commented Feb 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

absurdfarce left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SiyaoIsHiding commented Jun 20, 2024

SiyaoIsHiding commented Jun 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment